40 research outputs found

    GLAWI, a free XML-encoded Machine-Readable Dictionary built from the French Wiktionary

    Get PDF
    International audienceThis article introduces GLAWI, a large XML-encoded machine-readable dictionary automatically extracted from Wiktionnaire, the French edition of Wiktionary. GLAWI contains 1,341,410 articles and is released under a free license. Besides the size of its headword list, GLAWI inherits from Wiktionnaire its original macrostructure and the richness of its lexicographic descriptions: articles contain etymologies, definitions, usage examples, inflectional paradigms, lexical relations and phonemic transcriptions. The paper first gives some insights on the nature and content of Wiktionnaire, with a particular focus on its encoding format, before presenting our approach, the standardization of its microstructure and the conversion into XML. First intended to meet NLP needs, GLAWI has been used to create a number of customized lexicons dedicated to specific uses including linguistic description and psycholinguistics. The main one is GLÀFF, a large inflectional and phonological lexicon of French. We show that many more specific on demand lexicons can be easily derived from the large body of lexical knowledge encoded in GLAWI

    Wiktionnaire's Wikicode GLAWIfied: a Workable French Machine-Readable Dictionary

    Get PDF
    International audienceGLAWI is a free, large-scale and versatile Machine-Readable Dictionary (MRD) that has been extracted from the French language edition of Wiktionary, called Wiktionnaire. In (Sajous and Hathout, 2015), we introduced GLAWI, gave the rationale behind the creation of this lexicographic resource and described the extraction process, focusing on the conversion and standardization of the heterogeneous data provided by this collaborative dictionary. In the current article, we describe the content of GLAWI and illustrate how it is structured. We also suggest various applications, ranging from linguistic studies, NLP applications to psycholinguistic experimentation. They all can take advantage of the diversity of the lexical knowledge available in GLAWI. Besides this diversity and extensive lexical coverage, GLAWI is also remarkable because it is the only free lexical resource of contemporary French that contains definitions. This unique material opens way to the renewal of MRD-based methods, notably the automated extraction and acquisition of semantic relations

    Ne jetons pas le Wiktionnaire avec l'oripeau du Web ! Études et rĂ©alisations fondĂ©es sur le dictionnaire collaboratif

    Get PDF
    Wiktionnaire est l'Ă©dition française de Wiktionnary, le dictionnaire libre multilingue accessible en ligne. Satellite de WikipĂ©dia, dont il constitue le "compagnon lexical", le projet dictionnairique reste dans l'ombre de l'encyclopĂ©die. FondĂ© comme elle sur le principe du wiki, il peut ĂȘtre alimentĂ© et modifiĂ© par tout internaute, avec publication immĂ©diate. Si la ressource encyclopĂ©dique a Ă©tĂ© abondamment utilisĂ©e dans certaines disciplines, le dictionnaire collaboratif semble avoir reçu moins d'attention de la part de la communautĂ© scientifique. Ce moindre intĂ©rĂȘt pourrait ĂȘtre le fruit d'une mĂ©connaissance ou d'un rejet a priori de l'amateurisme que l'on associe volontiers aux contributions effectuĂ©es par des naĂŻfs. Nous prĂ©sentons dans cet article quelques caractĂ©ristiques du Wiktionnaire, ainsi que des rĂ©alisations issues de cette ressource. Ce travail entend illustrer les possibilitĂ©s offertes par ce dictionnaire singulier et permettre de dĂ©cider si l'on peut tirer ou non bĂ©nĂ©fice de son exploitation, et pour quel usage. Plus prĂ©cisĂ©ment, nous questionnons la lĂ©gimitĂ© des ressources approvisionnĂ©es "par les foules" et nous Ă©tudions dans quelle mesure le Wiktionnaire peut, par ses spĂ©cificitĂ©s, complĂ©ter les ressources dictionnairiques existantes dans le cadre d'Ă©tudes linguistiques et, d'autre part, servir de point de dĂ©part Ă  la constitution d'un lexique Ă©lectronique pour des domaines comme le traitement automatique des langues et la psycholinguistique. Notre contribution Ă  la caractĂ©risation du Wiktionnaire s'accompagne de la mise Ă  disposition de deux lexiques construits Ă  partir du dictionnaire collaboratif. Le premier est un lexique morphophonologique Ă  trĂšs large couverture. DestinĂ© notamment aux applications de TAL, nous donnons des exemples possibles d'utilisation en linguistique outillĂ©e. Le second est un lexique orientĂ© vers la psycholinguistique. DĂ©rivĂ© du premier, il contient moins d'entrĂ©es, mais comprend pour chacune d'elle un ensemble d'informations habituellement utilisĂ©es dans cette discipline. Ces lexiques sont Ă  la fois sont tĂ©lĂ©chargeables et interrogeables en ligne

    Enrichissement de lexiques sémantiques approvisionnés par les foules : le systÚme WISIGOTH appliqué à Wiktionary

    Get PDF
    International audienceSemantic lexical resources are a mainstay of various NLP applications. However, comprehensive and reliable resources rarely exist or are often not freely available. We discuss in this paper the context of lexical resources building and the problems of evaluation. We present Wiktionary, a freely available and collaboratively built multilingual dictionary and we propose a semi-automatic approach based on random walks for enriching its synonymy network, which uses endogenous and exogenous data. We then propose a validation "by crowds". Finally, we present an implementation of this system called WISIGOTH.Bien que de nombreuses applications de TAL reposent sur des ressources lexicales sĂ©mantiques, celles-ci sont rarement simultanĂ©ment de qualitĂ© satisfaisante et librement disponibles. Partant de la confrontation entre mĂ©thodes traditionnelles et tendances Ă©mergentes de construction et d'Ă©valuation de ressources lexicales, nous prĂ©sentons dans cet article une nouvelle mĂ©thode fondĂ©e sur Wiktionary, un dictionnaire multilingue libre, disponible en ligne et construit collaborativement, puis nous proposons un enrichissement semi-automatique de son rĂ©seau de synonymie utilisant des donnĂ©es endogĂšnes et exogĂšnes, recourant Ă  une validation " par les foules ". Nous dĂ©crivons enïŹn une implĂ©mentation de ce systĂšme baptisĂ©e WISIGOTH

    Looking for French deverbal nouns in an evolving Web (a short history of WAC)

    Get PDF
    International audienceThis paper describes an 8-year-long research effort for automatically collecting new French deverbal nouns on the Web. The goal has remained the same: building an extensive and cumulative list of noun-verb pairs where the noun denotes the action expressed by the verb (e.g. production - produce). This list is used for both linguistic research and for NLP applications. The initial method consisted in taking advantage of the former Altavista search engine, allowing for a direct access to unknown word forms. The second technique led us to develop a specific crawler, which raised a number of technical difficulties. In the third experiment, we use a collection of web pages made available to us by a commercial search engine. Through all these stages, the general method has remained the same, and the results are similar and cumulative, although the technical environment has greatly evolved

    From GLÀFF to PsychoGLÀFF: a large psycholinguistics-oriented French lexical resource

    Get PDF
    International audienceIn this paper, we present two French lexical resources, GLÀFF and PsychoGLÀFF. The former, automatically extracted from the collaborative online dictionary Wiktionary, is a large-scale versatile lexicon exploitable in Natural Language Processing applications and linguistic studies. The latter, based on GLÀFF, is a lexicon specifically designed for psycholinguistic research. GLÀFF, counting more than 1.4 million entries, features an unprecedented size. It reports lemmas, main syntactic categories, inflectional features and phonemic transcriptions. PsychoGLÀFF contains additional information related to formal aspects of the lexicon and its distribution. It contains about 340,000 entries (120,000 lemmas) that are corpora-attested. We explain how the resources have been created and compare them to other known resources in terms of coverage and quality. Regarding PsychoGLÀFF, the comparison shows that it has an exceptionally large repertoire while having a comparable quality

    Acquisition and enrichment of morphological and morphosemantic knowledge from the French Wiktionary

    Get PDF
    International audienceWe present two approaches to automatically acquire morphologically related words from Wiktionary. Starting with related words explicitly mentioned in the dictionary, we propose a method based on orthographic similarity to detect new derived words from the entries' definitions with an overall accuracy of 93.5%. Using word pairs from the initial lexicon as patterns of formal analogies to filter new derived words enables us to rise the accuracy up to 99%, while extending the lexicon's size by 56%. In a last experiment, we show that it is possible to semantically type the morphological definitions, focusing on the detection of process nominals

    GLÀFF, un Gros Lexique À tout Faire du Français

    Get PDF
    International audienceThis paper introduces GLÀFF, a large-scale versatile French lexicon extracted from Wiktionary, the collaborative online dictionary. GLÀFF contains, for each entry, a morphosyntactic description and a phonetic transcription. It distinguishes itself from the other available lexicons mainly by its size, its potential for constant updating and its copylefted license that makes it available for use, modification and redistribution. We explain how we have built GLÀFF and compare it to other known resources. We show that its size and quality are strong assets that could allow GLÀFF to become a reference lexicon for NLP, linguistics and psycholinguistics.Cet article prĂ©sente GLÀFF, un lexique du français Ă  large couverture extrait du Wiktionnaire, le dictionnaire collaboratif en ligne. GLÀFF contient pour chaque entrĂ©e une description morphosyntaxique et une transcription phonĂ©mique. Il se distingue des autres lexiques existants principalement par sa taille, sa licence libre et la possibilitĂ© de le faire Ă©voluer de façon constante. Nous dĂ©crivons ici comment nous l'avons construit, puis caractĂ©risĂ© en le comparant Ă  diffĂ©rentes ressources connues. Cette comparaison montre que sa taille et sa qualitĂ© font de GLÀFF un candidat sĂ©rieux comme nouvelle ressource standard pour le TAL, la linguistique et la psycholinguistique

    Évaluation sur mesure de modĂšles distributionnels sur un corpus spĂ©cialisĂ© : comparaison des approches par contextes syntaxiques et par fenĂȘtres graphiques.

    Get PDF
    International audienceDistributional semantics models can be built using simple bag-of-word representation of a word's contexts (window-based) or using more complex syntactic information (syntax-based). Previous studies have compared their relative efficiency without coming to a definitive conclusion, but such examination has never been performed on small and specialised corpora. We have run a set of such comparative experiments based on a collection of French NLP articles and a custom-made gold standard. These experiments show a better global performance of syntax-based models, as long as syntactic information is processed with appropriate care.Il est possible de construire des modÚles distributionnels en ne considérant que la cooccurrence graphique entre les mots, ou bien en utilisant des relations syntaxiques de complexité variable. Si des comparaisons systématiques n'ont jamais pu trancher définitivement en faveur de l'une ou de l'autre, elles ont rarement été menées sur un corpus de taille réduite ou en langue de spécialité. Nous proposons ici une palette d'expériences visant l'observation d'un ensemble de modÚles distributionnels construits à partir d'un petit corpus d'articles en français dans le domaine du TAL. Un jeu de données a été spécifiquement conçu pour l'évaluation des différentes configurations. Ces expériences montrent que les modÚles qui prennent en compte de façon raisonnable les informations syntaxiques obtiennent globalement de meilleurs résultats

    Authorship Attribution: Using Rich Linguistic Features when Training Data is Scarce.

    Get PDF
    International audienceWe describe here the technical details of our participation to PAN 2012's "traditional" authorship attribution tasks. The main originality of our approach lies in the use of a large quantity of varied features to represent textual data, processed by a maximum entropy machine learning tool. Most of these features make an intensive use of natural language processing annotation techniques as well as generic language resources such as lexicons and other linguistic databases. Some of the features were even designed specifically for the target data type (contemporary fiction). Our belief is that richer features, that integrate external knowledge about language, have an advantage over knowledge-poorer ones (such as words and character n-grams frequencies) when training data is scarce (both in raw volume and number of training items for each target author). Although overall results were average (66% accuracy over the main tasks for the best run), we will focus in this paper on the differences between feature sets. If the "rich" linguistic features have proven to be better than trigrams of characters and word frequencies, the most efficient features vary widely from task to task. For the intrusive paragraphs tasks, we got better results (73 and 93%) while still using the maximum entropy engine as an unsupervised clustering tool
    corecore